Annex

Data Columns Detailed

Data summary
Name data
Number of rows 45896
Number of columns 26
_______________________
Column type frequency:
character 8
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Make 0 1 3 34 0 141 0
Model 0 1 1 47 0 4762 0
Fuel.Type.1 0 1 6 17 0 6 0
Fuel.Type.2 0 1 0 11 44059 5 0
Drive 0 1 0 26 1186 8 0
Engine.Description 0 1 0 46 17031 590 0
Transmission 0 1 0 32 11 41 0
Vehicle.Class 0 1 4 34 0 34 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 23102.11 13403.10 1.00 11474.75 23090.50 34751.25 46332.00 ▇▇▇▇▇
Model.Year 0 1.00 2003.61 12.19 1984.00 1992.00 2005.00 2015.00 2023.00 ▇▆▆▇▇
Estimated.Annual.Petrolum.Consumption..Barrels. 0 1.00 15.33 4.34 0.05 12.94 14.88 17.50 42.50 ▁▇▃▁▁
City.MPG..Fuel.Type.1. 0 1.00 19.11 10.31 6.00 15.00 17.00 21.00 150.00 ▇▁▁▁▁
Highway.MPG..Fuel.Type.1. 0 1.00 25.16 9.40 9.00 20.00 24.00 28.00 140.00 ▇▁▁▁▁
Combined.MPG..Fuel.Type.1. 0 1.00 21.33 9.78 7.00 17.00 20.00 23.00 142.00 ▇▁▁▁▁
City.MPG..Fuel.Type.2. 0 1.00 0.85 6.47 0.00 0.00 0.00 0.00 145.00 ▇▁▁▁▁
Highway.MPG..Fuel.Type.2. 0 1.00 1.00 6.55 0.00 0.00 0.00 0.00 121.00 ▇▁▁▁▁
Combined.MPG..Fuel.Type.2. 0 1.00 0.90 6.43 0.00 0.00 0.00 0.00 133.00 ▇▁▁▁▁
Engine.Cylinders 487 0.99 5.71 1.77 2.00 4.00 6.00 6.00 16.00 ▇▇▅▁▁
Engine.Displacement 485 0.99 3.28 1.36 0.00 2.20 3.00 4.20 8.40 ▁▇▅▂▁
Time.to.Charge.EV..hours.at.120v. 0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
Time.to.Charge.EV..hours.at.240v. 0 1.00 0.11 1.01 0.00 0.00 0.00 0.00 15.30 ▇▁▁▁▁
Range..for.EV. 0 1.00 2.36 24.97 0.00 0.00 0.00 0.00 520.00 ▇▁▁▁▁
City.Range..for.EV…Fuel.Type.1. 0 1.00 1.62 20.89 0.00 0.00 0.00 0.00 520.80 ▇▁▁▁▁
City.Range..for.EV…Fuel.Type.2. 0 1.00 0.17 2.73 0.00 0.00 0.00 0.00 135.28 ▇▁▁▁▁
Hwy.Range..for.EV…Fuel.Type.1. 0 1.00 1.51 19.70 0.00 0.00 0.00 0.00 520.50 ▇▁▁▁▁
Hwy.Range..for.EV…Fuel.Type.2. 0 1.00 0.16 2.46 0.00 0.00 0.00 0.00 114.76 ▇▁▁▁▁

Data summary

The table below provides an overview of the dataset.

Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
ID 45896 23102 13403 1 11475 34751 46332
Model.Year 45896 2004 12 1984 1992 2015 2023
Estimated.Annual.Petrolum.Consumption..Barrels. 45896 15 4.3 0.047 13 18 43
Fuel.Type.1 45896
... Diesel 1254 3%
... Electricity 484 1%
... Midgrade Gasoline 155 0%
... Natural Gas 60 0%
... Premium Gasoline 14138 31%
... Regular Gasoline 29805 65%
City.MPG..Fuel.Type.1. 45896 19 10 6 15 21 150
Highway.MPG..Fuel.Type.1. 45896 25 9.4 9 20 28 140
Combined.MPG..Fuel.Type.1. 45896 21 9.8 7 17 23 142
Fuel.Type.2 45896
... 44059 96%
... E85 1513 3%
... Electricity 296 1%
... Natural Gas 20 0%
... Propane 8 0%
City.MPG..Fuel.Type.2. 45896 0.85 6.5 0 0 0 145
Highway.MPG..Fuel.Type.2. 45896 1 6.6 0 0 0 121
Combined.MPG..Fuel.Type.2. 45896 0.9 6.4 0 0 0 133
Engine.Cylinders 45409 5.7 1.8 2 4 6 16
Engine.Displacement 45411 3.3 1.4 0 2.2 4.2 8.4
Time.to.Charge.EV..hours.at.120v. 45896 0 0 0 0 0 0
Time.to.Charge.EV..hours.at.240v. 45896 0.11 1 0 0 0 15
Range..for.EV. 45896 2.4 25 0 0 0 520
City.Range..for.EV...Fuel.Type.1. 45896 1.6 21 0 0 0 521
City.Range..for.EV...Fuel.Type.2. 45896 0.17 2.7 0 0 0 135
Hwy.Range..for.EV...Fuel.Type.1. 45896 1.5 20 0 0 0 520
Hwy.Range..for.EV...Fuel.Type.2. 45896 0.16 2.5 0 0 0 115

Data cleaned overview

Cleaned Dataset

Name Number_of_rows Number_of_columns Character Numeric Group_variables
data_cleaned 42240 18 8 5 None

Cleaned and Reduced Dataset

Name Number_of_rows Number_of_columns Character Numeric Group_variables
data_cleaned_reduced 42061 18 8 5 None

3D Biplot for 6 clusters

Warning in PCA(data_prepared, graph = FALSE): Missing values are imputed by the
mean of the variable: you should use the imputePCA function of the missMDA
package

After looking at the silhouette plot in the unsupervised learning part, we decided to provide a 3D biplot for 6 clusters, as we can also see in the elbow plot that 6 seem to be optimal in a way. In this biplot, we can observe that it is possible to divide into 6 clusters. When comparing it to the 3D biplot in the ‘results_unsupervised_learning’ part, we clearly notice that cluster 2 could be divided into four smaller clusters, which indicates heterogeneity in this cluster when using only 3 clusters. However, with 6 clusters in hand, it is more difficult to interpret the 4 distinct clusters. In addition to that, it explains the second elbow in the elbow method: at 3 clusters, we obtained optimality, but we get another steep curve between cluster 5 and 6, meaning that selecting 4 or 5 clusters would not be too much of a benefit, but adding a 6th cluster could be worth capturing. Stopping at 3 cluster still is significant for us and it makes our clustering anaylsis more interpretable than 6, that’s why we selected only 3 clusters for our analysis.